33 research outputs found

    Multimodal segmentation of lifelog data

    Get PDF
    A personal lifelog of visual and audio information can be very helpful as a human memory augmentation tool. The SenseCam, a passive wearable camera, used in conjunction with an iRiver MP3 audio recorder, will capture over 20,000 images and 100 hours of audio per week. If used constantly, very soon this would build up to a substantial collection of personal data. To gain real value from this collection it is important to automatically segment the data into meaningful units or activities. This paper investigates the optimal combination of data sources to segment personal data into such activities. 5 data sources were logged and processed to segment a collection of personal data, namely: image processing on captured SenseCam images; audio processing on captured iRiver audio data; and processing of the temperature, white light level, and accelerometer sensors onboard the SenseCam device. The results indicate that a combination of the image, light and accelerometer sensor data segments our collection of personal data better than a combination of all 5 data sources. The accelerometer sensor is good for detecting when the user moves to a new location, while the image and light sensors are good for detecting changes in wearer activity within the same location, as well as detecting when the wearer socially interacts with others

    A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures

    Get PDF
    Subjective similarity between musical pieces and artists is an elusive concept, but one that music be pursued in support of applications to provide automatic organization of large music collections. In this paper, we examine both acoustic and subjective approaches for calculating similarity between artists, comapring their performance on a common database of 400 popular artists. Specifically, we evaluate acoustic techniques based on Mel-frequency cepstral coefficients and an intermediate `anchor space' of genre classification, and subjective techniques which use data from The All Music Guide, from a survey, from playlists and personal collections, and from web-text mining. We find the following: (1) Acoustic-base measures can acheive agreement with ground truth data that is at least comparable to the internal agreement between different subjective sources. However, we observe significant differences between suerficially similar distribution modeling and comparison techniques. (2) Subjective measures from diverse sources show reasonable agreement, with the measure derived from co-occurrence in personal music collections being the most reliable overall. (3) Our methodology for large-scale cross-site music similarity evaluations is practical and convenient, yielding directly comparable numbers for different approaches. In particular, we hope that for out information-retrieval-based approach to scoring similarity measures, our paradigm of sharing common feature representations, and even our particular dataset of features for 400 artists, will be useful to other researchers

    A Simulation of Vowel Segregation Based on Across-Channel Glottal-Pulse Synchrony

    No full text
    this paper describes what we believe to be a new, additional method to help in separation. We have been particularly struck by the enhanced prominence of sounds with slight frequency modulationcompared to completelyunmodulated tones, as very plainly demonstrated by McAdams (1984, 1989). Considering the frequency modulation characteristics of natural speech sounds, we speculated that there may be mechanisms in the auditory system that are able to detect theshort-term cycle-to-cycle fluctuations in thefundamental period of real speech, and use these to help separate distinct, simultaneous voices. We will describe the algorithm that we have developed to exhibit these qualities, and its effectiveness at this task, which exceeded our preliminary expectations. Due to its dependence on variability over a very short time scale, this approach may be considered a time-domain algorithm in contrast to the harmonic-tracking frequency-domain approaches which have been popular. In section 2, we discuss the nature of pitch-period variation in real speech as motivation for the new approach. In section 3, we briefly review some previous work in vowel separation, then explain the basis of our new technique. Section 4 gives some examples of the preliminary results we have obtained. The issues raised by the model are discussed in section 5. We conclude in section 6 with suggestions concerning how this work might be further validated and developed. 2. A MOTIVATION --- PITCH-PULSE VARIATION IN REAL SPEECH McAdams made a very compelling demonstration of the capacity of the auditory system to segregate vowels based on differences in fundamental frequency (McAdams 1984, 1989). Three different synthetic vowels with different fundamental frequencies are mixed together. The resulting sound is a den..

    Content-adaptive speech enhancement by a sparsely-activated dictionary plus low rank decomposition

    No full text
    International audienceOne powerful approach to speech enhancement employs strong models for both speech and noise, decomposing a mixture into the most likely combination. But if the noise encountered differs significantly from the system's assumptions, performance will suffer. In previous work, we proposed a speech enhancement model that decomposes the spectrogram into sparse activation of a dictionary of target speech templates, and a low-rank background model. This makes few assumptions about the noise, and gave appealing results on small excerpts of noisy speech. However, when processing whole conversations, the foreground speech may vary in its complexity and may be unevenly distributed throughout the recording, resulting in inaccurate decompositions for some segments. In this paper, we explore an adaptive formulation of our previous model that incorporates separate side information to guide the decomposition, making it able to better process entire conversations that may exhibit large variations in the speech content

    Extracting information from music audio

    No full text

    Tandem Connectionist Feature Extraction for Conventional HMM Systems

    No full text
    Hidde nMark ovmodelspeechrecognitio nsystemstypicall yuse Gaussia nmixtur emodelstoestimat ethedistri butionsofdecor - relate dacousti cfeatur evector stha tcorrespon dtoind ividua lsubwordunits .Bycontrast ,hybri dconnectionist-HM Msystemsuse discriminat ively-traine dneura lnetwork stoestimat etheprobabilitydistri butionamongsubwordunit sgiventheacousti cobser vations. Inthisworkweshowalargeimpr ovementinwordrecognitio nperform ancebycombiningneural-ne tdiscrim inat ivefeature processin gwithGaussian-mixtur edistri butionmodeling.Bytrainingthenetworktogenerat ethesubwordprobabilit yposteriors, thenusingtransformation softheseestimate sasthebasefeatures foraconventionally-traine dGaussian-mixtur ebasedsystem,we achieverelat iveerro r rat ereduction sof35% ormoreonthemulticonditio nAuror anoisycontinuou sdigitstask. 1.INTRODUCTION Thestandar dstructur eofcurren tspeechrecognitio nsystemsconsistsofthre emainstages.First ,thesoundwaveformispassed throug hfeatur eextracti..
    corecore